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Abstract 

This paper explores the usefulness of a technique 
from software engineering, code instrumentation, 
for the development of large-scale natural language 
grammars. Information about the usage of gram- 
mar rules in test and corpus sentences is used to 
improve grammar and testsuite, as well as adapting 
a grammar to a specific genre. Results show that less 
than half of a large-coverage grammar for German 
is actually tested by two large testsuites, and that 
10-30% of testing time is redundant. This method- 
ology applied can be seen as a re-use of grammar 
writing knowledge for testsuite compilation. The 
construction of genre-specific grammars results in 
performance gains of a factor of four. [] 

1 Introduction 

The field of Computational Linguistics (CL) has 
both moved towards applications and towards large 
data sets. These developments call for a rigorous 
methodology for creating so-called lingware: linguis- 
tic data such as lexica, grammars, tree-banks, as 
well as software processing it. Experience from Soft- 
ware Engineering has shown that the earlier deficien- 
cies are detected, the less costly their correction is. 
Rather than being a post-development effort, quality 
evaluation must be an integral part of development 
to make t he construction of lingware more efficient 
(e.g., cf. (EAGLES, 1996 ) for a genera l evaluation 
framework and ( Ciravegna et al., 1998| ) for the ap- 
plication of a particular software design methodol- 
ogy to linguistic engineering). This paper presents 
the adaptation of a particular Software Engineering 
(SE) method, instrumentation, to Grammar Engi- 
neering (GE). Instrumentation allows to determine 
which test item exercises a certain piece of (software 
or grammar) code. 

The paper first describes the use of instrumenta- 
tion in SE, then discusses possible realizations in uni- 



1 The experiments reported here were conducted during 
my work at the Institut fur Maschinelle Sprachverarbcitung 
(IMS), Stuttgart University, Germany. I'd like to thank Jonas 
Kuhn (IMS) and John Maxwell (Xerox PARC) for their help 
in conducting these experiments. 



fication grammars, and finally presents two classes 
of applications. 

2 Software Instrumentation 

Systematic software testing requires a match be- 
tween the test subject (module or complete system) 
and a test suite (collection of test items, i.e., sam- 
ple input). This match is usually computed as the 
percentage of code items exercised by the test suite. 
Depending on the definition of a code item, vari- 



ous meas ures ar e employed, for example (cf. (Hot 
1988|) and flEAGLES, 1996|, Appendix B 



zel 



overviews): 



for 



statement coverage percentage of single state- 
ments exercised 

branch coverage percentage of arcs exercised in 
control flow graph; subsumes statement cover- 
age 

path coverage percentage of paths exercised from 
start to end in control flow graph; subsumes 
branch coverage; impractical due to large (often 
infinite) number of paths 

condition coverage percentage of (simple or ag- 
gregate) conditions evaluated to both true and 
false (on different test items) 

Testsuites are constructed to maximize the tar- 
geted measure. A test run yields information about 
the code items not exercised, allowing the improve- 
ment of the testsuite. 

The measures are automatically obtained by in- 
strumentation: The test subject is extended by code 
which records the code items exercised during pro- 
cessing. After processing the testsuite, the records 
are used to compute the measures. 

3 Grammar Instrumentation 

Measures from SE cannot simply be transferred to 
unification grammars, because the structure of (im- 
perative) programs is different from that of (declar- 
ative) grammars. Nevertheless, the structure of a 
grammar (formalism) allows to define measures very 
similar to those employed in SE. 



constraint coverage is the quotient 

# constraints exercised 
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# constraint in grammar 

where a constraint may be either a phrase- 
structure or an equational constraint, depend- 
ing on the formalism. 

disjunction coverage is the quotient 



V |=T; 

NP? |= (t OBJ); 
PP* { 1= (T OBL); 

l£ (T ADJUNCT); }. 

Figure 1: Sample Rule 



VP 



T dis = 



=ff disjunctions covered 
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where a disjunction is considered covered when 
all its alternative disjuncts have been separately 
exercised. It encompasses constraint coverage. 
Optional constituents and equations have to be 
treated as a disjunction of the constraint and 
an empty constraint (cf . Fig.^j for an example) . 

interaction coverage is the quotient 
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# disjunct combinations exercised 
# legal disjunct combinations 



where a disjunct combination is a complete set 
of choices in the disjunctions which yields a well- 
formed grammatical structure. 
As with path coverage, the set of legal disjunct 



combination typically is infinite due to recur- 



sion. A solution from SE is to restrict the use 
of recursive rules to a fixed number of cases, for 
example not using the rule at all, and using it 
only once. 

The goal of instrumentation is to obtain informa- 
tion about which test cases exercise which grammar 
constraints. One way to record this information is 
to extend the parsing algorithm. Another way is 
to use the grammar formalism itself to identify the 
disjuncts. Depending on the expressivity of the for- 
malism used, the following possibilities exist: 

atomic features Assuming a unique numbering 
of disjuncts, an annotation of the form 
DISJUNCT-nn = + can be used for marking. To 
determine whether a certain disjunct was used 
in constructing a solution, one only needs to 
check whether the associated feature occurs (at 
some level of embedding) in the solution. 

set- valued features If set- valued features are 
available, one can use a set- valued fea- 
ture DISJUNCTS to collect atomic sym- 
bols representing one disjunct each: 
DISJUNCT-nn G DISJUNCTS, which might 
ease the collection of exercised disjuncts. 

multiset of symbols To recover the number of 
times a disjunct is used, one needs to leave the 



Figure 2: Instrumented rule 



unification paradigm, because it is very difficult 
to count with unification grammars. The Xerox 
Linguistic Environment used here (XLE; cf. 
www . pare . xerox . com/istl/grou ps/ nltt / par gram 
and ( Kaplan and Newman, 199^ )) provides for 
a multiset of symbols to be associated with each 
complete structural analysis: Following the 
LFG spirit of different projections, it defines a 
projection of symbolic marks which is fo rmally 
equivalent to a multiset of symbols (cf. ( Frank 
et al., 1998) for an introduction and several 
applications). Thus, one may recover the set of 
all disjuncts used from each analysis, together 
with their frequency. 



Consider the LFG grammar rule in FigJ^.Q Con- 
straint coverage would require test items such that 
every category in the VP is exercised; a sequence of 
V NP PP would suffice for this measure. Disjunction 
coverage also requires to take the empty disjuncts 
into account: NP and PP are optional, so that four 
items are needed to achieve full disjunction cover- 
age on the phrase structure part of the rule. Due to 
the disjunction in the PP annotation, two more test 
items are required to achieve full disjunction cover- 
age on the complete rule. Fig.|| shows the rule from 
Fig.|l| with instrumentation. 



2 Although the sample rules are in the format of LFG, 
nothing of the methodology relies on the choice of linguistic 
or computational paradigm. The notation: ?/*/+ represent 
optionality/iteration including/excluding zero occurrences on 
categories, e represents the empty string. Annotations to a 
category specify equality (=) or set membership (g) of fea- 
ture values, or non-existence of features (-■); they are termi- 
nated by a semicolon (;). Disjunctions are given in braces 
({...|...}). | (J.) are metavariables representing the fea- 
ture structure corresponding to the mother (daughter) of the 
rule, o* (for optimality) represents the sentence's multi-set 
valued symbolic projection Comments are enclosed i n quo- 
tation marks ("..."). Cf. (Kaplan and Bresnan, 1982) for an 
introduction to LFG notation. 



4 Grammar and Testsuite 
Improvement 

Traditionally, a testsuite is used to improve (or 
maintain) a grammar's quality (in terms of cover- 
age and overgeneration) . Using instrumentation, 
one may extend this us age by looking for sources of 
overgeneration (cf. Sec. |4.3| ), and may also improve 
the qua lity of the testsuite, in term s of coverage (cf. 
Sec. [4.l| ) and economy (cf. Sec. 4. 2). 

Complemen ting other work on testsuite construc- 
tion (cf. Sec. 4.4), I will assume that a grammar 
is already available, and that a testsuite has to be 
constructed or extended. While one may argue that 
grammar and testsuite should be developed in paral- 
lel, such that the coding of a new grammar disjunct 
is accompanied by the addition of suitable test cases, 
and vice versa, this is seldom the case. Apart from 
the existence of grammars which lack a testsuite, 
there is the more principled obstacle of the evolu- 
tion of the grammar, leading to states where previ- 
ously necessary rules silently loose their usefulness, 
because their function is taken over by some other 
rules, structured differently. This i s de tectable by 
instrumentation, as discussed in Sec. 4.1. 

On the other hand, once there is a testsuite, it has 
to be u sed economically, avoiding redundant tests. 
Sec. 4. 2 shows that there are different levels of re- 



dundancy in a testsuite, dependent on the specific 
grammar used. Reduction of this redundancy can 
speed up the test activity, and give a clearer picture 
of the grammar's performance. 

4.1 Testsuite Completeness 

If the disjunction coverage of a testsuite is 1 for some 
grammar, the testsuite is complete w.r.t. this gram- 
mar. Such a testsuite can reliably be used to mon- 
itor changes in the grammar: Any reduction in the 
grammar's coverage will show up in the fail ure of 
some test case (for negative test cases, cf. Sec.4.3). 

If the testsuite is not complete, instrumentation 
can identify disjuncts which are not exercised. These 
might be either (i) appropriate, but untested, dis- 
juncts calling for the addition of a test case, or (ii) in- 
appropriate disjuncts, for which a grammatical test 
case exercising them cannot be constructed. 

Experiments were based on a large Ger- 
man LFG grammar developed at the IMS (cf. 
www . ims . uni-stuttgart . de/proj ekte/pargram 
and flKuhn et al., 1998j ; [Kuhn and Rohrcr, I997| )). 
We found that a testsuite of 1787 items collected to 
support grammar development only exercised 1456 
out of 3730 grammar disjuncts, yielding Tdis = 0.39. 
The TSNLP testsuite containing 1093 items exer- 
cised only 1081 disjuncts, yielding Tdis = 0.28) .PI 



3 There are, of course, unparsed but grammatical test cases 
in both testsuites, which have not been taken into account 
in these figures. This explains the difference to the overall 



PPstd => Pprae J=T; 

NPstd |= (T OBJ); 

{ e DISJUNCT-011 G o*; 

| Pcircum J=T; 

DISJUNCT-012 6 o* 
"unused disjunct" ; } 

Figure 3: Appropriate untested disjunct 

ADVP { { e DISJUNCT-021 g o*; 

| ADVadj |=T 

DISJUNCT-022 £ o* 
"unused disjunct"; } 

ADVstd J=T 

DISJUNCT-023 <= o* 
"unused disjunct"; } 

I - }• 

Figure 4: Inappropriate disjunct 

Fig.|| shows an example of a gap in our testsuite 
(there are no examples of circumpositions), while 
Fig.^ shows an inapproppriate disjunct thus discov- 
ered (the category ADVadj has been eliminated in 
the lexicon, but not in all rules). 

4.2 Testsuite Economy 

Besides being complete, a testsuite must be econom- 
ical, i.e., contain as few items as possible. Instru- 
mentation can identify redundant test cases, where 
redundancy can be defined in three ways: 

similarity There is a set of other test cases which 
jointly exercise all disjunct which the test case 
under consideration exercises. 

equivalence There is a single test case which ex- 
ercises exactly the same combination(s) of dis- 
juncts. 

strict equivalence There is a single test case 
which is equivalent to and, additionally, exer- 
cises the disjuncts exactly as often as, the test 
case under consideration. 

Fig.^ shows equivalent test cases found in our 
testsuite: Example 1 illustrates the distinction be- 
tween equivalence and strict equivalence; the test 
cases contain different numbers of attributive adjec- 
tives. Example 2 shows that our grammar does not 
make any distinction between adverbial usage and 
secondary (subject or object) predication. 

The achievable reduction in size and processing 
time is shown in Table [l], which contains measure- 
ments for a test run containing only the parseable 
test cases, one without equivalent test cases (for ev- 
ery set of equivalent test cases, one was arbitrar- 
ily selected), and one without similar test cases. 
The last was constructed using a simple heuristic: 



number of 1582 items in the German TSNLP testsuite. 



eiri guter alter Wein 

eiri guter alter trockener Wein 

'a good old (dry) wine' 

Er ifit das Schnitzel roll. 

Er ifit das Schnitzel nackt. 

Er ifit das Schnitzel schnell. 

L He eats the schnitzel naked/raw /quickly. 



Figure 5: Sets of equivalent test cases 



test relative 
cases size 



runtime 

(sec) 



relative 
runtime 



TSNLP testsuite 



parseable 


1093 


100% 


1537 


100% 


no equivalents 


783 


71% 


665.3 


43% 


no similar cases 


214 


19% 


128.5 


8% 


local testsuite 


parseable 


1787 


100% 


1213 


100% 


no equivalents 


1600 


89% 


899.5 


74% 


no similar cases 


331 


18% 


175.0 


14% 



Table 1: Reduction of Testsuites 

Starting with the sentence exercising the most dis- 
juncts, working towards sentences relying on fewer 
disjuncts, a sentence was selected only if it exercised 
a disjunct which no previously selected sentence ex- 
ercised. Assuming that a disjunct working correctly 
once will work correctly more than once, we did not 
consider strict equivalence. 

We envisage the following use of this redundancy 
detection: There clearly are linguistic reasons to dis- 
tinguish all test cases in example 2, so they cannot 
simply be deleted from the testsuite. Rather, their 
equivalence indicates that the grammar is not yet 
perfect (or never will be, if it remains purely syn- 
tactic). Such equivalences could be interpreted as 
a reminder which linguistic distinctions need to be 
incorporated into the grammar. Thus, this level of 
redundancy may drive your grammar development 
agenda. The level of equivalence can be taken as 
a limited interaction test: These test cases repre- 
sent one complete selection of grammar disjuncts, 
and (given the grammar) there is nothing we can 
gain by checking a test case if an equivalent one was 
tested. Thus, this level of redundancy may be used 
for ensuring the quality of grammar changes prior 
to their incorporation into the production version of 
the grammar. The level of similarity contains much 
less test cases, and does not test any (systematic) 
interaction between disjuncts. Thus, it may be used 
during development as a quick rule-of-thumb proce- 
dure detecting serious errors only. 

4.3 Sources of Over generation 

To control overgeneration, appropriately marked un- 
grammatical sentences are important in every test- 
suite. Instrumentation as proposed here only looks 
at successful parses, but can still be applied in this 



Der Test fallt leicht. 
Die schlafen. 

Man schlafen. 
Dieser schlafen. 
Ich schlafen. 
Der schlafen. 
Jeder schlafen. 
Derjenige schlafen. 
Jener schlafen. 
Keiner schlafen. 
Derselbe schlafen. 
Er schlafen. 
Irgendjemand schlafen. 



Dieselbe schlafen. 
Das schlafen. 
Eines schlafen. 
Jede schlafen. 
Dieses schlafen. 
Eine schlafen. 
Meins schlafen. 
Dasjenige schlafen. 
Jedes schlafen. 
Diejenige schlafen. 
Jenes schlafen. 
Keines schlafen. 
Dasselbe schlafen. 



Figure 6: Sentences relying on suspicious disjunct 



context: If an ungrammatical test case receives an 
analysis, instrumentation informs us about the dis- 
juncts used in the incorrect analysis. One of these 
disjuncts must be incorrect, or the sentence would 
not have received a solution. We exploit this infor- 
mation by accumulation across the entire test suite, 
looking for disjuncts that appear in unusually high 
proportion in parseable ungrammatical test cases. 

In this manner, six grammar disjuncts are singled 
out by the parseable ungrammatical test cases in 
the TSNLP testsuite. The most prominent disjunct 
appears in 26 sentences (listed in Fig.|6|), of which 
the top left group is indeed grammatical and the 
rest fall into two classes: A partial VP with object 
NP, interpreted as an imperative sentence (bottom 
left), and a weird interaction with the tokenizer in- 
correctly handling capitalization (right group). 

Far from being conclusive, the similarity of these 
sentences derived from a suspicious grammar dis- 
junct, and the clear relation of the sentences to only 
two exactly specifiable grammar errors make it plau- 
sible that this approach is very promising in detect- 
ing the sources of overgeneration. 

4.4 Other Approaches to Testsuite 
Construction 

The delic acy of testsuite construction is acknowl- 
edged in ( [EAGLES, 1996| , p.37). Although there 
are a num 



Der of efforts to construct reusable test- 
suites, none has to my knowledge explored how ex- 
isting grammars can be exploited. 



Starting with (Flickinger et al., 1987), testsuites 
have been drawn up from a linguistic viewpoint, in- 
formed by [the] study of linguistics and [reflecting] 
the grammatical i ssues that linguists have concerned 
themselves with (Flickinger ct al., 1987, p. 4). Al- 



t hough the questio n is not explicitly addressed in 
( Estivalet al., 1994 ), all the testsuites reviewed there 
also seem to foll ow the same methodology. The 
TSNLP project ( Lchmann and Ocpcn, 1996| ) and 
its successor DiET~[ Netter et al., 1998 ), which built 



large multilingual testsuites, likewise fall into this 
category. 

The use of corpora (with various levels of annota- 
tion) has been studied, but the recommendations are 
that much manual work is required to turn corpus 



Descriptor Content 



Coverage 



exam ples into test cases (e.g., (Balkan and Fouvry. 
1995)). The reason given is that corpus sentences 



neither contain linguistic phenomena in isolation, 
nor do they contain systematic variation. Corpora 
thus are used only as an inspiration. 



~[ Oepen and Flickinger, 1998) stress the inter- 
dependence between application and testsuitc, but 
don't comment on the relation between grammar 
and testsuite. 

5 Genre Adaptation 

A different application of instrumentation is the tai- 
loring of a general grammar to specific genres. All- 
purpose grammars are plagued by lexical and struc- 
tural ambiguity that leads to overly long runtimes. 
If this ambiguity could be limited, parsing efficiency 
would improve. Instrumenting a general grammar 
allows to automatically derive specialized subgram- 
mars based on sample corpora. This setup has sev- 
eral advantages: The larger the overlap between gen- 
res, the larger the portion of grammar development 
work that can be recycled. The all-purpose grammar 
is linguistically more interesting, because it requires 
an integrated concept, as opposed to several sepa- 
rate genre-specific grammars. 

I will discuss two ways of improving the efficiency 
of parsing a sublanguage, given an all-purpose uni- 
fication grammar. The first consists in deleting un- 
used disjuncts, while the second uses a staged pars- 
ing process. The experiments are only sketched, 
to indicate the applicability of the instrumentation 
technique, and not to directly compete with other 
proposals on gramma r specialization. For example , 



the work reported in (Rayner and Samuclsson, 1994 



Samuelsson, 1994) differs from the one presented be 
low in several aspects: They induce a grammar from 
a treebank, while I propose to annotate the gram- 
mar based on all solutions it produces. No criteria 
for tree decomposition and category specialization 
are needed here, and the standard parsing algorithm 
can be used. On the other hand, the ef ficiency gains 
are not as big as those reported by (Rayner and 
Samuelsson, 1994). 



5.1 Restricting the Grammar 

Given a large sample of a genre, instrumentation al- 
lows you to determine the likely constructions of that 
genre. Eliminating unused disjuncts allows faster 
parsing due to a smaller grammar. An experiment 
was conducted with several corpora as detailed in 
Table |[ There was some effort to cover the corpus 
HC-DE, but no grammar development based on the 



HC-DE Copier/Printer User Manual 89% 

WHB Car Maintenance Instructions 76% 

NEWS News (5-30 words per sentence) 42% 

NEWS-SC Verb-final subclauses from News 75% 

Table 2: Corpora used for adaptation 



other corpora. The NEWS-SC corpus is part the 



corpu s of verb-final sentences used by ( Beil et al 



1999) 



A training set of 1000 sentences from each cor- 
pus was parsed with an instrumented base gram- 
mar. From the parsing results, the exercised gram- 
mar disjuncts were extracted and used to construct 
a corpus-specific reduced grammar. The reduced 
grammars were then used to parse a test set of an- 
other 1000 sentences from each corpus. Table || 
shows the performance improvement on the corpora: 
It gives the size of the grammars in terms of the 
number of rules (with regular expression right-hand 
sides and feature annotation), the number of arcs 
(corresponding to unary or binary rules with dis- 
junctive feature annotation), and the number of dis- 
juncts (unary or binary rules with unique feature 
annotation). The number of mismatches counts the 
sentences for which the solution(s) obtained differed 
from those obtained with the base grammar, while 
the number of additions counts the sentences which 
did not receive a parse with the base grammar due 
to resource limitations (runtime or memory), but re- 
ceived one with the reduced grammar. The other 
columns give timings to process the total corpus, 
and the longest and average processing time per sen- 
tence; time is in seconds. The last column gives the 
average number of solutions per sentence. 

Due to the sampling of a genre, the grammars 
obtained can only be approximate. To determine 
the relation of the sample size to the quality of the 
grammar obtained, the coverage of random fragment 
grammars was measured in the following way: Ran- 
domly select a number of sentences from the to- 
tal corpus, construct (in the same way as described 
above for the reduced grammar) a fragment gram- 
mar, and determine its coverage on the test set from 
the respective corpus. The graphs in Fig.^ show 
how the coverage and runtime relate to the number 
of sentences on which the fragment grammars are 
based. The leftmost data point (x value 0) describes 
the performance of the reduced grammar on the 
training set, while the rightmost data point describes 
its performance on the test set. The data points in 
between represent fragment grammars based on as 
many sentences as given by the x axis value. 

The results reported here represent the minimal 
performance gain due to the fact that the construc- 
tion of reduced and fragment grammars are not 





# of rules 

# of arcs 

#of 
disjuncts 


#of 
mismatches 

#of 
additions 


total time 

max. time 
per sentence 

avg. time 
per sentence 


Corpus HC-DE 
base grammar 
reduced grammar (938) 


185 3669 11564 
112 960 3739 


n/a n/a 
1 


7692.4 >300 7.1 
2089.4 162.7 1.9 


Corpus WHB 
base grammar 
reduced grammar (559) 


195 3728 11606 
534 3072 


n/a 
1 


1428.9 >300.3 1.5 
444.2 11.3 0.4 



Table 3: Performance of reduced grammars 
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Figure 7: Performance of fragment grammars 



based on the correct solutions for the training sen- 
tences, but rather on all solutions produced by the 
base grammar. The construction of a large-scale 
treebank with manually verified solutions is under 
way but has not yet progressed far enough to serve 
as input for this experiment. Even with this system- 
atic, but curable error, the reduction reduces overall 
processing by a factor of four. The number of solu- 
tions is constant because only unused disjuncts are 
eliminated; this will change if the treebank solutions 
are used to construct the reduced grammar. 

5.2 Staged Parsing 

Even eliminating only unlikely disjuncts necessarily 
reduces the coverage of the grammar. A sequence of 
parsing stages allows one to profit from a small and 
fast grammar as well as from a large and slow one. 
Staged parsing applies different grammars one after 
the other to the input, until one yields a solution, 
which terminates the process. In our case, a gram- 
mar of stage n + 1 includes the grammar of stage n, 
but this need not be the case in general. 

To reduce the variability for an experiment, I as- 
sume three stages: The first includes frequently used 
disjuncts, the second all used disjuncts, and the third 
all grammar disjuncts. This ensures the same cover- 
age as the base grammar, but allows to focus on fre- 



quent constructions in the first parsing stage. The 
procedure is similar as before: From the solutions 
of a training set, a staged grammar is constructed. 
Currently, experiments are performed to determine 
a useful definition of 'frequently used'. Independent 
from the actual performance gains finally obtained, 
the application of instrumentation allows a system- 
atic exploration of the possible configurations. 

5.3 Other approaches to grammar 
adaptation 



(IRayncr and Samuelsson, 1994| ; [Rayner and Carter 



1996| ; pamuelsson, 1994| ) present a grammar special- 
ization technique for unification grammars. From a 
treebank of the sublanguage, they induce a special- 
ized grammar using fewer macro rules which cor- 
respond to the application of several original rules. 
They report an average speed-up of 55 for only the 
parsing phase (taking lexical lookup into account, 
the speed-up factor was only 6-10). Due to the 
derivation of the grammar from a corpus sample, 
they observed a decrease in recall of 7.3% and an 
increase of precision of 1.6%. The differences to the 
approach described here are clear: Starting from the 
grammar, rather than from a treebank, we annotate 
the rules, rather than inducing them from scratch. 
We do not need criteria for tree decomposition and 



category specialization, and we can use the standard 
parsing algorithm. On the other hand, the e fficiency 
gains arc not as big as those reported by (Rayner 
and Carter, 1996) (but note that we cannot measure 
parsing times alone, so we need to compare to their 
speed-up factor of 10). And we did not (yet) start 
from a treebank, but from the raw set of solutions. 

6 Conclusion 

I have presented the adaptation of code instrumenta- 
tion to Grammar Engineering, discussing measures 
and implementations, and sketching several applica- 
tions together with preliminary results. 

The main application is to improve grammar and 
testsuite by exploring the relation between both of 
them. Viewed this way, testsuite writing can ben- 
efit from grammar development because both de- 
scribe the syntactic constructions of a natural lan- 
guage. Testsuites systematically list these construc- 
tions, while grammars give generative procedures 
to construct them. Since there are currently many 
more grammars than testsuites, we may re-use the 
work that has gone into the grammars for the im- 
provement of testsuites. 

Other applications of instrumentation are possi- 
ble; genre adaptation was discussed in some depth. 
On a more general level, one may ask whether other 
methods from SE may fruitfully apply to GE as well, 
possibly in modified form. For example, the static 
analysis of programs, e.g., detection of unreachable 
code, could also be applied for grammar develop- 
ment to detect unusable rules. 
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